Individual Exercises: Extending our first pipeline: review and add

This report contains the results of the extended analysis in the RNAseq data reported in Brauns et al, 2022 (https://insight.jci.org/articles/view/154183) using and modifying the code provided by Prof. Gomez-Cabrero during the second class on January 30, 2024.

1. Reviewed the normalization and differential expression

Experimental design

Lets review experimental design from a practical perspective

# Read data from GEO
urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"
path <- paste(urld, "acc=GSE198256", "file=GSE198256_raw_counts_GRCh38.p13_NCBI.tsv.gz", sep="&");
GSE198256_count <- as.matrix(data.table::fread(path, header=T, colClasses="integer"), rownames=1)

# Read Meta data from GEO
library(GEOquery)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
gds <- getGEO("GSE198256")
## Found 1 file(s)
## GSE198256_series_matrix.txt.gz
Meta_GSE198256 <- pData(gds$GSE198256_series_matrix.txt.gz@phenoData)
Group <- Meta_GSE198256[,c("disease state:ch1")]

#Check the size of our count matrix
dim(GSE198256_count)
## [1] 39376    34
#check the structure of our groups of comparison 
Group
##  [1] "Healthy"                  "Healthy"                 
##  [3] "Healthy"                  "Healthy"                 
##  [5] "Healthy"                  "Healthy"                 
##  [7] "Healthy"                  "Healthy"                 
##  [9] "Healthy"                  "Healthy"                 
## [11] "Healthy"                  "Covid19: Acute infection"
## [13] "Covid19: Acute infection" "Covid19: Acute infection"
## [15] "Covid19: Acute infection" "Covid19: Acute infection"
## [17] "Covid19: Acute infection" "Covid19: Acute infection"
## [19] "Covid19: Recovery 3Mo"    "Covid19: Recovery 3Mo"   
## [21] "Covid19: Recovery 3Mo"    "Covid19: Recovery 3Mo"   
## [23] "Covid19: Recovery 3Mo"    "Covid19: Recovery 3Mo"   
## [25] "Covid19: Recovery 6Mo"    "Covid19: Recovery 6Mo"   
## [27] "Covid19: Recovery 6Mo"    "Covid19: Recovery 6Mo"   
## [29] "Covid19: Recovery 6Mo"    "Covid19: Recovery 6Mo"   
## [31] "Covid19: Recovery 6Mo"    "Covid19: Recovery 6Mo"   
## [33] "Covid19: Recovery 6Mo"    "Covid19: Recovery 6Mo"

Limma: Normalize and set design

## Loading required package: limma
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
## Loading required package: edgeR
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE

Limma Voom or Trend?

After normalization we need to define which is the better transformation for our data.

## Trend

# If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential expression is to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold.
logCPM <- cpm(dge, log=TRUE, prior.count=3)
 # The prior count is used here to damp down the variances of logarithms of low counts.
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
 # logical, should an intensity-dependent trend be allowed for the prior variance? If FALSE then the prior variance is constant. Alternatively, trend can be a row-wise numeric vector, which will be used as the covariate for the prior variance.
 # The use of eBayes or treat with trend=TRUE is known as the limma-trend method (Law et al, 2014; Phipson et al, 2016). With this option, an intensity-dependent trend is fitted to the prior variances s2.prior
topTable(fit, coef=ncol(design))
##               logFC    AveExpr         t      P.Value    adj.P.Val         B
## 29090      3.287737  1.4550637  7.827982 2.035253e-09 3.339647e-05 11.121251
## 2919      -3.427478  1.9415851 -5.894407 8.189137e-07 4.194087e-03  5.690910
## 727684    -2.868134 -1.1139954 -5.852319 9.348748e-07 4.194087e-03  5.569920
## 105379599 -2.518509 -0.8079827 -5.823881 1.022387e-06 4.194087e-03  5.488153
## 1435      -3.960957  0.1185299 -5.567817 2.287780e-06 4.263485e-03  4.751644
## 7013       1.420032  3.8470271  5.556059 2.373908e-06 4.263485e-03  4.717831
## 120534     1.482396  3.8701683  5.506442 2.774453e-06 4.263485e-03  4.575161
## 92935      2.590940  0.8298893  5.486884 2.950258e-06 4.263485e-03  4.518935
## 112935969  2.537765  0.3804905  5.486785 2.951173e-06 4.263485e-03  4.518651
## 59274     -1.718493  4.6857453 -5.480403 3.010933e-06 4.263485e-03  4.500304
## Voom

# When the library sizes are quite variable between samples, then the voom approach is theoretically more powerful than limma-trend.
v <- voom(dge, design, plot=TRUE)

v
## An object of class "EList"
## $targets
##            group lib.size norm.factors
## GSM5942339     1 12888012    0.9252685
## GSM5942340     1 12906182    0.8717521
## GSM5942341     1 16741955    0.8495083
## GSM5942342     1 36171318    1.1289978
## GSM5942343     1 15841677    0.9681553
## 29 more rows ...
## 
## $E
##           GSM5942339 GSM5942340 GSM5942341 GSM5942342 GSM5942343 GSM5942344
## 653635     2.5694299   3.149213  2.9175974  1.9159829   3.862970   2.383019
## 100996442  0.1700231   3.337916  2.4022094  2.0853206   3.469674   2.300082
## 729737     3.7256700   4.006977  3.8263876  4.0142849   4.685003   2.712984
## 102725121 -4.6879579  -2.368062  0.6625243 -0.5043489  -1.078763  -1.400357
## 102723897  3.1385906   2.761221  3.2788998  2.1315647   4.072338   2.521255
##           GSM5942345 GSM5942346 GSM5942347 GSM5942348 GSM5942349 GSM5942350
## 653635     3.2376749   2.647956   3.005527   3.053062   3.435758   2.821699
## 100996442  1.4333100   1.730418   2.192613   1.540249   3.914804   2.350001
## 729737     1.8946411   1.302997   2.340711   4.199505   3.938586   3.229534
## 102725121 -0.5877516  -1.439507  -1.142371  -3.817303  -1.773696  -2.827916
## 102723897  3.4997112   2.660953   2.882013   2.782610   3.457868   2.963898
##           GSM5942351 GSM5942352 GSM5942353 GSM5942354 GSM5942355 GSM5942356
## 653635     3.2645433   3.108424   2.998408   2.345331   3.097234  2.9613870
## 100996442  2.7275861   2.677789   2.137045   2.201661   2.182829  1.6812791
## 729737     4.4838331   4.559842   3.433060   4.287790   3.407119  3.3201804
## 102725121 -0.2464186  -1.108807  -1.739944  -1.468878  -2.119291 -0.1552222
## 102723897  3.5221438   3.292641   3.133263   2.641040   3.251000  2.9681602
##           GSM5942357 GSM5942358 GSM5942359 GSM5942360 GSM5942361 GSM5942362
## 653635      2.212244   2.544888   3.961393   2.710442  3.2381643  3.1721957
## 100996442   3.079687   2.877787   1.462845   2.190833  3.6547101  3.4545954
## 729737      4.209488   3.676372   2.694792   3.302425  4.7656577  4.2974944
## 102725121  -1.361748  -2.903573  -3.963420  -4.969038  0.9520073 -0.7571498
## 102723897   2.756249   2.477249   2.906945   2.361879  2.6682144  3.3000456
##           GSM5942363 GSM5942364 GSM5942365 GSM5942366 GSM5942367 GSM5942368
## 653635     3.8714573   3.388411  3.7337305   3.117049   3.576033   3.500154
## 100996442  3.5276618   2.415788  1.9975366   2.166925   2.561185   3.059863
## 729737     5.1875864   2.349190  2.1159781   2.599365   4.466084   4.145748
## 102725121 -0.4881631  -0.732604 -0.1918058  -2.934613  -1.159292  -1.841674
## 102723897  3.8123913   3.410354  3.5247836   3.187783   3.633942   3.269172
##           GSM5942369 GSM5942370 GSM5942371 GSM5942372
## 653635      2.387982  2.9731408   3.007646   3.071485
## 100996442   2.551702  2.7584841   1.470778   1.963844
## 729737      2.885739  4.5176513   3.086935   3.465148
## 102725121  -1.388742 -0.7009476  -1.000528  -1.015978
## 102723897   2.483798  2.9189487   3.176665   3.197801
## 16404 more rows ...
## 
## $weights
##           [,1]      [,2]      [,3]      [,4]     [,5]      [,6]      [,7]
## [1,] 1.6088153 1.6108296 2.0346929 3.8277382 1.936379 2.0797871 1.5962575
## [2,] 1.0713544 1.0725576 1.3301090 2.6326930 1.269007 1.3584116 1.0638489
## [3,] 2.0713941 2.0740077 2.6107430 4.5499602 2.490011 2.6666829 2.0551013
## [4,] 0.3159636 0.3160897 0.3409413 0.4352323 0.335424 0.3434468 0.3151749
## [5,] 1.7029317 1.7050944 2.1541225 3.9908141 2.049802 2.2025588 1.6894500
##           [,8]     [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
## [1,] 2.1311204 1.525874 0.9721884 1.5065777 2.1675032 2.4912866 1.9942964
## [2,] 1.3905931 1.021868 0.6959994 1.0103782 1.4378057 1.6512870 1.3268040
## [3,] 2.7303015 1.964066 1.2177456 1.9384819 3.5844301 3.9894678 3.3465546
## [4,] 0.3462583 0.310679 0.2681120 0.3094226 0.3672072 0.3857036 0.3568433
## [5,] 2.2560409 1.614330 1.0209010 1.5938558 2.4034481 2.7542595 2.2139786
##          [,15]     [,16]     [,17]     [,18]     [,19]     [,20]     [,21]
## [1,] 2.5887275 2.4913843 2.8114657 3.1348733 1.8003714 1.4661422 1.0799422
## [2,] 1.7191512 1.6513550 1.8736001 2.1105265 1.6035525 1.3110639 0.9760765
## [3,] 4.1090420 3.9895890 4.3533698 4.6868077 3.0187638 2.4834525 1.7930218
## [4,] 0.3912807 0.3857092 0.4035633 0.4214937 0.2933262 0.2756959 0.2508461
## [5,] 2.8590316 2.7543703 3.0957002 3.4337656 1.5631297 1.2792819 0.9549237
##          [,22]     [,23]     [,24]     [,25]     [,26]     [,27]     [,28]
## [1,] 1.9749668 1.6314675 1.8267348 1.8501963 1.9257433 3.3085181 1.7869270
## [2,] 1.7579519 1.4552169 1.6268724 1.1332755 1.1751734 2.0423102 1.0982447
## [3,] 3.2751264 2.7560367 3.0599286 2.1209426 2.2076129 3.7008653 2.0483604
## [4,] 0.3017194 0.2847204 0.2946262 0.3418983 0.3464739 0.4229429 0.3379949
## [5,] 1.7127214 1.4190039 1.5857158 1.8488760 1.9243554 3.3064566 1.7856629
##          [,29]     [,30]     [,31]     [,32]     [,33]     [,34]
## [1,] 1.9007586 2.4245967 3.1799756 2.1124290 1.7218899 1.4544809
## [2,] 1.1613386 1.4636657 1.9518885 1.2809026 1.0626088 0.9181417
## [3,] 2.1789492 2.7668551 3.5683945 2.4186653 1.9745350 1.6640368
## [4,] 0.3449775 0.3750121 0.4159205 0.3574665 0.3339041 0.3160494
## [5,] 1.8993932 2.4228776 3.1780138 2.1109220 1.7206561 1.4534706
## 16404 more rows ...
## 
## $design
##   (Intercept) GroupCovid196Mo GroupCovid19AI GroupHealthy
## 1           1               0              0            1
## 2           1               0              0            1
## 3           1               0              0            1
## 4           1               0              0            1
## 5           1               0              0            1
## 29 more rows ...
# The voom method is similar in purpose to the limma-trend method, which uses eBayes or treat with trend=TRUE. The voom method incorporates the mean-variance trend into the precision weights, whereas limma-trend incorporates the trend into the empirical Bayes moderation. The voom method takes into account the sequencing depths (library sizes) of the individual columns of counts and applies the mean-variance trend on an individual observation basis. limma-trend, on the other hand, assumes that the library sizes are not wildly different and applies the mean-variance trend on a genewise basis. As noted by Law et al (2014), voom should be more powerful than limma-trend if the library sizes are very different but, otherwise, the two methods should give similar results.
fitV <- lmFit(v, design)
fitV <- eBayes(fitV)
topTable(fitV, coef=ncol(design))
##               logFC    AveExpr         t      P.Value   adj.P.Val        B
## 29090      4.161867  1.2216892  6.416136 1.684408e-07 0.002657002 6.030796
## 727684    -4.207169 -2.3402062 -6.206378 3.238469e-07 0.002657002 5.670244
## 2919      -3.749755  1.7717956 -5.832236 1.042316e-06 0.003642824 5.417822
## 5778      -2.157988  2.6630123 -5.720712 1.477187e-06 0.003642824 5.206465
## 101928173 -3.668782 -0.2985125 -5.808920 1.121143e-06 0.003642824 5.067868
## 105376281 -3.538534 -0.2886424 -5.704496 1.554011e-06 0.003642824 4.901121
## 105379599 -3.707908 -1.5669060 -5.924411 7.813501e-07 0.003642824 4.812158
## 123879    -1.672692  3.4958284 -5.523271 2.737879e-06 0.003945484 4.539344
## 59274     -1.727872  4.6755229 -5.527930 2.698330e-06 0.003945484 4.433349
## 57862     -1.339288  4.8662912 -5.506472 2.885356e-06 0.003945484 4.357617

ACTIVITY 1.1 :How would you compare the results between voom and trend?

To compare both methods, first I will plot the result of applying each transformation to the data. This is a good hint to verify weather the transformation did work or not.

#plotting voom
#ploting voom fit
plotSA(fitV, main="Final model: Voom transformation")

#plotting trend fit
plotSA(fit, main="Final model: Trend transformation")

The plots above were created using plotSA which plots log2 residual standard deviations against mean log-CPM values. In both plots, each black dot represents a gene and the average log2 residual standard deviation estimated by the empirical Bayes algorithm is marked by a horizontal blue line.From the graph we can also conclude that voom showed better fit than trend, since only with voom the trends are gone. This could be explain partially due to the fact that we have some library size discrepancies in the dataset, with some big libraries like GSM5942341 (14.1) and other almost three times smaller, like GSM5942348 (5.3)(see figure below, for more details please refer to noiseq analysis on report 1) then the power of voom outperformed trend. Other way to explore the effect of Voom vs Trend is to check into the differentially express genes obtained with each method.

#DEGs Limma-trend
summary(decideTests(fit))
##        (Intercept) GroupCovid196Mo GroupCovid19AI GroupHealthy
## Down          1649            1087           1944          649
## NotSig        3298           13635          11613        14810
## Up           11462            1687           2852          950
#DEGs Limma-Voom
summary(decideTests(fitV))        
##        (Intercept) GroupCovid196Mo GroupCovid19AI GroupHealthy
## Down          2107             937           1928          670
## NotSig        3435           14401          11585        15151
## Up           10867            1071           2896          588

We can observe that voom approach allowed the identification of similar number of degs.Now lets see how corelated are those DEGs based on their p-values.

a <- lm(fitV$p.value~fit$p.value)
#do the plot
plot(fit$p.value, fitV$p.value, main="scatter plot of p-values", xlab="Limma-trend", ylab = "Limma-voom")
#add line
abline(a, col="red")
## Warning in abline(a, col = "red"): only using the first two of 20 regression
## coefficients

#Show statistics
summary(a)
## Response (Intercept) :
## 
## Call:
## lm(formula = `(Intercept)` ~ fit$p.value)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57156 -0.02911 -0.02658 -0.02561  0.87289 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.025833   0.001893  13.646   <2e-16 ***
## fit$p.value(Intercept)      0.695629   0.005866 118.579   <2e-16 ***
## fit$p.valueGroupCovid196Mo -0.002959   0.004365  -0.678    0.498    
## fit$p.valueGroupCovid19AI   0.003778   0.004364   0.866    0.387    
## fit$p.valueGroupHealthy     0.004365   0.004243   1.029    0.304    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1489 on 16404 degrees of freedom
## Multiple R-squared:  0.4638, Adjusted R-squared:  0.4637 
## F-statistic:  3548 on 4 and 16404 DF,  p-value: < 2.2e-16
## 
## 
## Response GroupCovid196Mo :
## 
## Call:
## lm(formula = GroupCovid196Mo ~ fit$p.value)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60181 -0.02945 -0.01409  0.03206  0.63203 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.020260   0.001286  15.754  < 2e-16 ***
## fit$p.value(Intercept)     -0.016535   0.003985  -4.149 3.35e-05 ***
## fit$p.valueGroupCovid196Mo  0.925702   0.002965 312.192  < 2e-16 ***
## fit$p.valueGroupCovid19AI   0.019829   0.002964   6.689 2.31e-11 ***
## fit$p.valueGroupHealthy     0.012409   0.002882   4.305 1.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1011 on 16404 degrees of freedom
## Multiple R-squared:  0.886,  Adjusted R-squared:  0.886 
## F-statistic: 3.189e+04 on 4 and 16404 DF,  p-value: < 2.2e-16
## 
## 
## Response GroupCovid19AI :
## 
## Call:
## lm(formula = GroupCovid19AI ~ fit$p.value)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53876 -0.01869 -0.00638  0.01689  0.56447 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.006463   0.001108   5.833 5.54e-09 ***
## fit$p.value(Intercept)      0.021661   0.003433   6.309 2.88e-10 ***
## fit$p.valueGroupCovid196Mo  0.015308   0.002555   5.992 2.11e-09 ***
## fit$p.valueGroupCovid19AI   0.945867   0.002554 370.388  < 2e-16 ***
## fit$p.valueGroupHealthy    -0.009608   0.002483  -3.869  0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08712 on 16404 degrees of freedom
## Multiple R-squared:  0.9115, Adjusted R-squared:  0.9115 
## F-statistic: 4.224e+04 on 4 and 16404 DF,  p-value: < 2.2e-16
## 
## 
## Response GroupHealthy :
## 
## Call:
## lm(formula = GroupHealthy ~ fit$p.value)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56685 -0.03878 -0.02008  0.03853  0.79389 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 3.108e-02  1.430e-03  21.726  < 2e-16 ***
## fit$p.value(Intercept)     -3.174e-05  4.432e-03  -0.007    0.994    
## fit$p.valueGroupCovid196Mo -1.707e-02  3.298e-03  -5.175 2.30e-07 ***
## fit$p.valueGroupCovid19AI   2.667e-02  3.297e-03   8.090 6.36e-16 ***
## fit$p.valueGroupHealthy     9.338e-01  3.206e-03 291.283  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1125 on 16404 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8683 
## F-statistic: 2.704e+04 on 4 and 16404 DF,  p-value: < 2.2e-16

From the previuos plot and statistics (specially \(R^{2}adj\)) we can observe a high level of correlation (\(R^{2}adj > 0.8\)) for all the contrast except the intercept term. Conclusion, voom fit performed slighly better than trend but we can trust the DEGs from both approaches since are representing the most relevant/strong differential expressed.

ACTIVITY 1.2: What exactly are we asking with this differential expression?

With the current design: (\(design <- model.matrix(~ Group )\)) we are using a design matrix with an intercept term.If we see our design matrix we can observe that it has the same number of columns as the factor Group has levels.The design matrix contains a column of 1s (the intercept term), and a column with values as 0s or 1s (a value of 1 when the associated sample is in the group specified in the column (e.g. GroupCovid196Mo) and 0 otherwise).

(Intercept) GroupCovid196Mo GroupCovid19AI GroupHealthy
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1

This design matrix is parameterised for a mean-reference model, where the intercept term in the first column is parameterised for the mean expression of the Covid193Mo,the second column is parameterised for the mean expression of the Covid196Mo group relative to Covid193Mo, the third column is parameterised for the mean expression of the Covid19AI group relative to Covid193Mo and the fourth column is parameterised for the mean expression of the Healthy group relative to Covid193Mo. The Covid193Mo group is selected as the reference level since is the first level in group when ordered alphanumerically. This is because, the levels in a factor are ordered alphanumerically by default. Conclusion, with the current design and levels distribution we are looking into gene expression differences between Covid196Mo/Covid19AI/Healthy against the Covid193Mo group.

ACTIVITY 1.3: Is it required to run more analysis?

Yes, with the current configuration of levels I can not see the contrast as I expect. For reproducibility I expect to produce the same contrasts as in the original publication.In the paper the authors mentioned they performed the DE analysis by comparing the different groups of COVID19 patients with the healthy controls.Since the reference level by default does not correspond with my real reference, I will do a re-specification of the reference level using the relevel function, to define the healthy status as the reference for the contrasts.

#Convert Group in a factor and relevel to make Healthy our reference
Group2 <- relevel(factor(Group), ref = "Healthy")
#Create our new design matrix
design2 <- model.matrix(~ Group2)
design2
##    (Intercept) Group2Covid193Mo Group2Covid196Mo Group2Covid19AI
## 1            1                0                0               0
## 2            1                0                0               0
## 3            1                0                0               0
## 4            1                0                0               0
## 5            1                0                0               0
## 6            1                0                0               0
## 7            1                0                0               0
## 8            1                0                0               0
## 9            1                0                0               0
## 10           1                0                0               0
## 11           1                0                0               0
## 12           1                0                0               1
## 13           1                0                0               1
## 14           1                0                0               1
## 15           1                0                0               1
## 16           1                0                0               1
## 17           1                0                0               1
## 18           1                0                0               1
## 19           1                1                0               0
## 20           1                1                0               0
## 21           1                1                0               0
## 22           1                1                0               0
## 23           1                1                0               0
## 24           1                1                0               0
## 25           1                0                1               0
## 26           1                0                1               0
## 27           1                0                1               0
## 28           1                0                1               0
## 29           1                0                1               0
## 30           1                0                1               0
## 31           1                0                1               0
## 32           1                0                1               0
## 33           1                0                1               0
## 34           1                0                1               0
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Group2
## [1] "contr.treatment"

Now that we have our design matrix as desired lets calculate the differential expression for the expected contrasts.

#fit to the new design
v2 <- voom(dge, design2, plot=TRUE)

v2
## An object of class "EList"
## $targets
##            group lib.size norm.factors
## GSM5942339     1 12888012    0.9252685
## GSM5942340     1 12906182    0.8717521
## GSM5942341     1 16741955    0.8495083
## GSM5942342     1 36171318    1.1289978
## GSM5942343     1 15841677    0.9681553
## 29 more rows ...
## 
## $E
##           GSM5942339 GSM5942340 GSM5942341 GSM5942342 GSM5942343 GSM5942344
## 653635     2.5694299   3.149213  2.9175974  1.9159829   3.862970   2.383019
## 100996442  0.1700231   3.337916  2.4022094  2.0853206   3.469674   2.300082
## 729737     3.7256700   4.006977  3.8263876  4.0142849   4.685003   2.712984
## 102725121 -4.6879579  -2.368062  0.6625243 -0.5043489  -1.078763  -1.400357
## 102723897  3.1385906   2.761221  3.2788998  2.1315647   4.072338   2.521255
##           GSM5942345 GSM5942346 GSM5942347 GSM5942348 GSM5942349 GSM5942350
## 653635     3.2376749   2.647956   3.005527   3.053062   3.435758   2.821699
## 100996442  1.4333100   1.730418   2.192613   1.540249   3.914804   2.350001
## 729737     1.8946411   1.302997   2.340711   4.199505   3.938586   3.229534
## 102725121 -0.5877516  -1.439507  -1.142371  -3.817303  -1.773696  -2.827916
## 102723897  3.4997112   2.660953   2.882013   2.782610   3.457868   2.963898
##           GSM5942351 GSM5942352 GSM5942353 GSM5942354 GSM5942355 GSM5942356
## 653635     3.2645433   3.108424   2.998408   2.345331   3.097234  2.9613870
## 100996442  2.7275861   2.677789   2.137045   2.201661   2.182829  1.6812791
## 729737     4.4838331   4.559842   3.433060   4.287790   3.407119  3.3201804
## 102725121 -0.2464186  -1.108807  -1.739944  -1.468878  -2.119291 -0.1552222
## 102723897  3.5221438   3.292641   3.133263   2.641040   3.251000  2.9681602
##           GSM5942357 GSM5942358 GSM5942359 GSM5942360 GSM5942361 GSM5942362
## 653635      2.212244   2.544888   3.961393   2.710442  3.2381643  3.1721957
## 100996442   3.079687   2.877787   1.462845   2.190833  3.6547101  3.4545954
## 729737      4.209488   3.676372   2.694792   3.302425  4.7656577  4.2974944
## 102725121  -1.361748  -2.903573  -3.963420  -4.969038  0.9520073 -0.7571498
## 102723897   2.756249   2.477249   2.906945   2.361879  2.6682144  3.3000456
##           GSM5942363 GSM5942364 GSM5942365 GSM5942366 GSM5942367 GSM5942368
## 653635     3.8714573   3.388411  3.7337305   3.117049   3.576033   3.500154
## 100996442  3.5276618   2.415788  1.9975366   2.166925   2.561185   3.059863
## 729737     5.1875864   2.349190  2.1159781   2.599365   4.466084   4.145748
## 102725121 -0.4881631  -0.732604 -0.1918058  -2.934613  -1.159292  -1.841674
## 102723897  3.8123913   3.410354  3.5247836   3.187783   3.633942   3.269172
##           GSM5942369 GSM5942370 GSM5942371 GSM5942372
## 653635      2.387982  2.9731408   3.007646   3.071485
## 100996442   2.551702  2.7584841   1.470778   1.963844
## 729737      2.885739  4.5176513   3.086935   3.465148
## 102725121  -1.388742 -0.7009476  -1.000528  -1.015978
## 102723897   2.483798  2.9189487   3.176665   3.197801
## 16404 more rows ...
## 
## $weights
##           [,1]      [,2]      [,3]      [,4]     [,5]      [,6]      [,7]
## [1,] 1.6088153 1.6108296 2.0346929 3.8277382 1.936379 2.0797871 1.5962575
## [2,] 1.0713544 1.0725576 1.3301090 2.6326930 1.269007 1.3584116 1.0638489
## [3,] 2.0713941 2.0740077 2.6107430 4.5499602 2.490011 2.6666829 2.0551013
## [4,] 0.3159636 0.3160897 0.3409413 0.4352323 0.335424 0.3434468 0.3151749
## [5,] 1.7029317 1.7050944 2.1541225 3.9908141 2.049802 2.2025588 1.6894500
##           [,8]     [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
## [1,] 2.1311204 1.525874 0.9721884 1.5065777 2.1675032 2.4912866 1.9942964
## [2,] 1.3905931 1.021868 0.6959994 1.0103782 1.4378057 1.6512870 1.3268040
## [3,] 2.7303015 1.964066 1.2177456 1.9384819 3.5844301 3.9894678 3.3465546
## [4,] 0.3462583 0.310679 0.2681120 0.3094226 0.3672072 0.3857036 0.3568433
## [5,] 2.2560409 1.614330 1.0209010 1.5938558 2.4034481 2.7542595 2.2139786
##          [,15]     [,16]     [,17]     [,18]     [,19]     [,20]     [,21]
## [1,] 2.5887275 2.4913843 2.8114657 3.1348733 1.8003714 1.4661422 1.0799422
## [2,] 1.7191512 1.6513550 1.8736001 2.1105265 1.6035525 1.3110639 0.9760765
## [3,] 4.1090420 3.9895890 4.3533698 4.6868077 3.0187638 2.4834525 1.7930218
## [4,] 0.3912807 0.3857092 0.4035633 0.4214937 0.2933262 0.2756959 0.2508461
## [5,] 2.8590316 2.7543703 3.0957002 3.4337656 1.5631297 1.2792819 0.9549237
##          [,22]     [,23]     [,24]     [,25]     [,26]     [,27]     [,28]
## [1,] 1.9749668 1.6314675 1.8267348 1.8501963 1.9257433 3.3085181 1.7869270
## [2,] 1.7579519 1.4552169 1.6268724 1.1332755 1.1751734 2.0423102 1.0982447
## [3,] 3.2751264 2.7560367 3.0599286 2.1209426 2.2076129 3.7008653 2.0483604
## [4,] 0.3017194 0.2847204 0.2946262 0.3418983 0.3464739 0.4229429 0.3379949
## [5,] 1.7127214 1.4190039 1.5857158 1.8488760 1.9243554 3.3064566 1.7856629
##          [,29]     [,30]     [,31]     [,32]     [,33]     [,34]
## [1,] 1.9007586 2.4245967 3.1799756 2.1124290 1.7218899 1.4544809
## [2,] 1.1613386 1.4636657 1.9518885 1.2809026 1.0626088 0.9181417
## [3,] 2.1789492 2.7668551 3.5683945 2.4186653 1.9745350 1.6640368
## [4,] 0.3449775 0.3750121 0.4159205 0.3574665 0.3339041 0.3160494
## [5,] 1.8993932 2.4228776 3.1780138 2.1109220 1.7206561 1.4534706
## 16404 more rows ...
## 
## $design
##   (Intercept) Group2Covid193Mo Group2Covid196Mo Group2Covid19AI
## 1           1                0                0               0
## 2           1                0                0               0
## 3           1                0                0               0
## 4           1                0                0               0
## 5           1                0                0               0
## 29 more rows ...
fitV2 <- lmFit(v2, design2)
fitV2 <- eBayes(fitV2)
plotSA(fitV2, main="Final model: Voom transformation")

#We can check all the degs directly 
summary(decideTests(fitV2))
##        (Intercept) Group2Covid193Mo Group2Covid196Mo Group2Covid19AI
## Down          2274              588                0             661
## NotSig        2586            15151            16409           14737
## Up           11549              670                0            1011
# or one by one
topTable(fitV2) 
## Removing intercept from test coefficients
##       Group2Covid193Mo Group2Covid196Mo Group2Covid19AI    AveExpr        F
## 9509        0.43753967       -1.5508977        8.105772 -0.2386102 39.54145
## 7045        0.08140918        0.1537577       -1.612767  8.2375143 31.29399
## 11326      -0.31093056       -0.3313694        2.611574  3.5989265 30.77581
## 920         0.32080239        0.2248797       -2.126645  7.2546704 29.33666
## 2322        0.06116568       -0.3253756        3.128272  4.3439994 29.05401
## 7079        0.99598151        0.7825211        5.852175 -3.2470195 28.25468
## 55076       1.16209965        0.7636346        6.573700 -2.3285550 28.10334
## 55022       0.80182133       -0.2341417       -3.974775  6.5885428 27.60392
## 57118       0.20757069        0.3014643       -2.453259  4.7869364 27.57466
## 79071      -1.35919516        0.3657930        4.231485 -1.7389007 26.69860
##            P.Value    adj.P.Val
## 9509  1.181389e-11 1.938541e-07
## 7045  2.858339e-10 1.947326e-06
## 11326 3.560228e-10 1.947326e-06
## 920   6.643212e-10 2.470437e-06
## 2322  7.527690e-10 2.470437e-06
## 7079  1.076901e-09 2.677346e-06
## 55076 1.153341e-09 2.677346e-06
## 55022 1.448854e-09 2.677346e-06
## 57118 1.468469e-09 2.677346e-06
## 79071 2.206723e-09 3.353851e-06
topTable(fitV2,coef=1) #Covid19AI-Healthy
##           logFC   AveExpr        t      P.Value    adj.P.Val        B
## 23196  7.847252  7.855804 138.9305 4.219937e-52 6.924495e-48 105.8803
## 2219  10.493791 10.453941 135.7340 1.000941e-51 8.212220e-48 105.2289
## 9555   8.103740  8.252498 133.7294 1.738439e-51 8.332178e-48 104.8102
## 4478   9.442048  9.547212 131.7222 3.046686e-51 8.332178e-48 104.4237
## 1520  11.828031 11.713498 132.6107 2.374157e-51 8.332178e-48 104.3860
## 79026 12.085459 11.910560 131.9700 2.841510e-51 8.332178e-48 104.2095
## 8826   9.907703  9.849239 130.3115 4.542674e-51 1.064868e-47 104.0726
## 3936  10.510217 10.648310 128.7914 7.020214e-51 1.439934e-47 103.6675
## 10618  8.360382  8.380979 128.0219 8.767872e-51 1.598578e-47 103.5388
## 8301   8.802993  8.648682 127.4487 1.035600e-50 1.699316e-47 103.4339
topTable(fitV2,coef=2) #Covid193Mo-Healthy
##               logFC    AveExpr         t      P.Value   adj.P.Val        B
## 29090     -4.161867  1.2216892 -6.416136 1.684408e-07 0.002657002 6.030796
## 727684     4.207169 -2.3402062  6.206378 3.238469e-07 0.002657002 5.670244
## 2919       3.749755  1.7717956  5.832236 1.042316e-06 0.003642824 5.417822
## 5778       2.157988  2.6630123  5.720712 1.477187e-06 0.003642824 5.206465
## 101928173  3.668782 -0.2985125  5.808920 1.121143e-06 0.003642824 5.067868
## 105376281  3.538534 -0.2886424  5.704496 1.554011e-06 0.003642824 4.901121
## 105379599  3.707908 -1.5669060  5.924411 7.813501e-07 0.003642824 4.812158
## 123879     1.672692  3.4958284  5.523271 2.737879e-06 0.003945484 4.539344
## 59274      1.727872  4.6755229  5.527930 2.698330e-06 0.003945484 4.433349
## 57862      1.339288  4.8662912  5.506472 2.885356e-06 0.003945484 4.357617
topTable(fitV2,coef=3) #Covid196Mo-Healthy
##               logFC    AveExpr        t      P.Value  adj.P.Val         B
## 3008      1.1618793  7.8027241 5.024076 1.295072e-05 0.08678802 2.8316894
## 8971      1.0875381  6.4957990 4.954021 1.608666e-05 0.08678802 2.6637955
## 105372599 2.8142572 -2.2719391 5.084797 1.072837e-05 0.08678802 2.3547090
## 100188987 3.2110970 -1.2120058 4.801433 2.575850e-05 0.08678802 1.8662731
## 106780802 2.4930949 -2.5931448 4.792882 2.644525e-05 0.08678802 1.6405003
## 1408      0.9498234  4.0373516 4.527113 5.965672e-05 0.12421359 1.6140483
## 8349      0.8532691  7.0057546 4.485882 6.762493e-05 0.12421359 1.2442540
## 389741    2.9701320 -0.9085315 4.472342 7.046350e-05 0.12421359 1.1852787
## 6676      3.0133831 -0.3789289 4.410286 8.504493e-05 0.12421359 1.1553350
## 254100    2.9278218 -1.6397971 4.428505 8.048113e-05 0.12421359 0.9644804

Other option to check our degs is to look into specific contrasts, this would be specially useful if we want to make comparisons among groups other than the comparison against the healthy group for example to compare acute covid19 patients vs late recovery patients we shoud specify our contrast as: Group2Covid19AI - Group2Covid196Mo. For this experiment, since we are interested in comparisons against healthy individual and we already selected healthy as the reference for the contrast we just need to specify the non-reference group.

contrast.matrix <- makeContrasts(Group2Covid19AI, Group2Covid193Mo, 
                                 Group2Covid196Mo,    
                                 levels=design2)
## Warning in makeContrasts(Group2Covid19AI, Group2Covid193Mo, Group2Covid196Mo, :
## Renaming (Intercept) to Intercept
fit3 <- contrasts.fit(fitV2, contrast.matrix)
## Warning in contrasts.fit(fitV2, contrast.matrix): row names of contrasts don't
## match col names of coefficients
fit3 <- eBayes(fit3)
topTable(fit3) 
##       Group2Covid19AI Group2Covid193Mo Group2Covid196Mo    AveExpr        F
## 9509         8.105772       0.43753967       -1.5508977 -0.2386102 39.54145
## 7045        -1.612767       0.08140918        0.1537577  8.2375143 31.29399
## 11326        2.611574      -0.31093056       -0.3313694  3.5989265 30.77581
## 920         -2.126645       0.32080239        0.2248797  7.2546704 29.33666
## 2322         3.128272       0.06116568       -0.3253756  4.3439994 29.05401
## 7079         5.852175       0.99598151        0.7825211 -3.2470195 28.25468
## 55076        6.573700       1.16209965        0.7636346 -2.3285550 28.10334
## 55022       -3.974775       0.80182133       -0.2341417  6.5885428 27.60392
## 57118       -2.453259       0.20757069        0.3014643  4.7869364 27.57466
## 79071        4.231485      -1.35919516        0.3657930 -1.7389007 26.69860
##            P.Value    adj.P.Val
## 9509  1.181389e-11 1.938541e-07
## 7045  2.858339e-10 1.947326e-06
## 11326 3.560228e-10 1.947326e-06
## 920   6.643212e-10 2.470437e-06
## 2322  7.527690e-10 2.470437e-06
## 7079  1.076901e-09 2.677346e-06
## 55076 1.153341e-09 2.677346e-06
## 55022 1.448854e-09 2.677346e-06
## 57118 1.468469e-09 2.677346e-06
## 79071 2.206723e-09 3.353851e-06
topTable(fit3,coef=1) #Covid19AI-Healthy
##               logFC    AveExpr         t      P.Value    adj.P.Val        B
## 9509       8.105772 -0.2386102  9.151225 4.648245e-11 7.627305e-07 14.43235
## 55076      6.573700 -2.3285550  8.482649 3.201407e-10 1.751063e-06 12.11434
## 7045      -1.612767  8.2375143 -8.083755 1.038505e-09 3.716100e-06 11.96217
## 7079       5.852175 -3.2470195  8.568670 2.489933e-10 1.751063e-06 11.92837
## 11326      2.611574  3.5989265  7.982045 1.405949e-09 3.716100e-06 11.84199
## 2322       3.128272  4.3439994  7.951867 1.538509e-09 3.716100e-06 11.71199
## 102724008  5.482236 -0.1681813  7.854528 2.058741e-09 4.222736e-06 11.07869
## 4128       5.933255 -1.2630316  7.941846 1.585270e-09 3.716100e-06 11.05461
## 241        2.033411  5.2271701  7.567324 4.890415e-09 8.024682e-06 10.50831
## 920       -2.126645  7.2546704 -7.573392 4.801412e-09 8.024682e-06 10.47407
topTable(fit3,coef=2) #Covid193Mo-Healthy
##               logFC    AveExpr         t      P.Value   adj.P.Val        B
## 29090     -4.161867  1.2216892 -6.416136 1.684408e-07 0.002657002 6.030796
## 727684     4.207169 -2.3402062  6.206378 3.238469e-07 0.002657002 5.670244
## 2919       3.749755  1.7717956  5.832236 1.042316e-06 0.003642824 5.417822
## 5778       2.157988  2.6630123  5.720712 1.477187e-06 0.003642824 5.206465
## 101928173  3.668782 -0.2985125  5.808920 1.121143e-06 0.003642824 5.067868
## 105376281  3.538534 -0.2886424  5.704496 1.554011e-06 0.003642824 4.901121
## 105379599  3.707908 -1.5669060  5.924411 7.813501e-07 0.003642824 4.812158
## 123879     1.672692  3.4958284  5.523271 2.737879e-06 0.003945484 4.539344
## 59274      1.727872  4.6755229  5.527930 2.698330e-06 0.003945484 4.433349
## 57862      1.339288  4.8662912  5.506472 2.885356e-06 0.003945484 4.357617
topTable(fit3,coef=3) #Covid196Mo-Healthy
##               logFC    AveExpr        t      P.Value  adj.P.Val         B
## 3008      1.1618793  7.8027241 5.024076 1.295072e-05 0.08678802 2.8316894
## 8971      1.0875381  6.4957990 4.954021 1.608666e-05 0.08678802 2.6637955
## 105372599 2.8142572 -2.2719391 5.084797 1.072837e-05 0.08678802 2.3547090
## 100188987 3.2110970 -1.2120058 4.801433 2.575850e-05 0.08678802 1.8662731
## 106780802 2.4930949 -2.5931448 4.792882 2.644525e-05 0.08678802 1.6405003
## 1408      0.9498234  4.0373516 4.527113 5.965672e-05 0.12421359 1.6140483
## 8349      0.8532691  7.0057546 4.485882 6.762493e-05 0.12421359 1.2442540
## 389741    2.9701320 -0.9085315 4.472342 7.046350e-05 0.12421359 1.1852787
## 6676      3.0133831 -0.3789289 4.410286 8.504493e-05 0.12421359 1.1553350
## 254100    2.9278218 -1.6397971 4.428505 8.048113e-05 0.12421359 0.9644804

Now if we check into the TopTable for each contrast obtained in the two previous chuncks we can see we got exactly the same results (the order varies a little but values are the same), so this are two ways of extract the information from our experiment. Now, going back to big numbers, if we look into the degs of our different contrasts summarized in the table below, we can observe that our trends correlate with the reported in the original manuscript at least for the contrast Late recovery vs Control (a.k.a Group2Covid196Mo) where the authors observed a lack of DEGs.

Intercept Group2Covid193Mo Group2Covid196Mo Group2Covid19AI
Down 2274 588 0 661
NotSig 2586 15151 16409 14737
Up 11549 670 0 1011

But if we want to properly estimate weather the number of degs are comparable or not with the original manuscript we need to filter our DEGs according to the significant thresholds used on the paper: FDR<0.5 and |FC| > 2.( Notice, A FC of 2 correspond to a log2fc = 1, since log2(2) = 1). The FDR threshold correspond to the default value on Limma, but for the LogFC we can apply The treat method (McCarthy and Smyth 2009) to calculate p-values from empirical Bayes moderated t-statistics with a minimum log-FC requirement.

tfit <- treat(fitV2, lfc=1)
dt <- decideTests(tfit)
summary(dt)
##        (Intercept) Group2Covid193Mo Group2Covid196Mo Group2Covid19AI
## Down           995                0                0               8
## NotSig        4974            16409            16409           16332
## Up           10440                0                0              69

From the summary we can observe that after applying a LFC cuttoff of 1 we loss the majority of DEGs from our analysis. Then this result is not comparable to the one obtained by the original authors using DESeq2, where a high number of DEGs wrere detected in two of the three contrast. However, this behavior is in part expected, since literature reports a higher number of detected DEGs using DESeq2 in comparison to Limma (Tong, 2021). This highlights the importance of selecting the appropriate tools to interrogate our data to extract the most relevant and reliable information.

ACTIVITY 2.Plan the next Analysis

The goal of differential expression analysis is to determine which genes are expressed at different levels between conditions, in our case between COVID19 patients at different stages of recovery and healthy controls. These genes can offer biological insight into the processes affected by the conditions of interest.So, to gain biological insight into the differential expression results we need to perform functional analysis on the detected differential express genes (DEGs). For this aim, we should explore and collect databases of our interest.Since, to explore the biological context in which our DEGs are involved we need to know first which genes from the global genome are associated with each biological phenomenon. Among the most whidely used online databases for functional analysis are included:
* Gene ontology: stores genes aassociated with particular biological processes, cellular components and molecular functions. * Biological pathways: stores genes involved in particular biological pathways, among the most popular are KEGG and Reactome * Others: there is almost one database for each biological question. for example, there are databases for genes associates to specific binding motifs, perturbation, epigenetic marks (e.g MSigDB), phenotypes (e.g Human phenotype ontology), and diseases (e.g OMIM)

In addition to the databases of our interest we need to define the techniques to use for the functional analysis. There are many approaches for this analysis we will focus in two methodologies:

  • Over-representation analysis: Compares the proportion of genes associated with a particular process/pathway in the list of differentially expressed genes to the proportion of genes associated with that pathway in a background list (genes tested for DE).Producing as a result processes/pathways related to genes exhibiting larger changes in expression between conditions (Piper,2023)

  • GSEA: is a functional class scoring method. particularly helpful when there are few DEGs or when processes have genes exhibiting weaker buy coordinated expression changes.The GSEA method uses the log2 fold changes for all genes from the differential expression results to determine whether any biological pathways are enriched among the genes with positive or negative fold changes. (Piper,2023)

ACTIVITY 3. Gene Set Analysis: ORA and GSEA

We start by preparing our data in the correct format for the ORA and GSEA analysis. First by converting our gene ids to the ENSEMBL format which is the required for the databases we are going to use for the functional annotation

library(clusterProfiler)
## 
## Registered S3 methods overwritten by 'treeio':
##   method              from    
##   MRCA.phylo          tidytree
##   MRCA.treedata       tidytree
##   Nnode.treedata      tidytree
##   Ntip.treedata       tidytree
##   ancestor.phylo      tidytree
##   ancestor.treedata   tidytree
##   child.phylo         tidytree
##   child.treedata      tidytree
##   full_join.phylo     tidytree
##   full_join.treedata  tidytree
##   groupClade.phylo    tidytree
##   groupClade.treedata tidytree
##   groupOTU.phylo      tidytree
##   groupOTU.treedata   tidytree
##   inner_join.phylo    tidytree
##   inner_join.treedata tidytree
##   is.rooted.treedata  tidytree
##   nodeid.phylo        tidytree
##   nodeid.treedata     tidytree
##   nodelab.phylo       tidytree
##   nodelab.treedata    tidytree
##   offspring.phylo     tidytree
##   offspring.treedata  tidytree
##   parent.phylo        tidytree
##   parent.treedata     tidytree
##   root.treedata       tidytree
##   rootnode.phylo      tidytree
##   sibling.phylo       tidytree
## clusterProfiler v4.8.3  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
## 
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
## 
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:stats':
## 
##     filter
library(msigdbr)
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:clusterProfiler':
## 
##     rename
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:clusterProfiler':
## 
##     slice
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:clusterProfiler':
## 
##     select
## 
library(magrittr)

keytypes(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"
# If we want to shift annotations:
ENSEMBL_vector <- mapIds(
  # Replace with annotation package for the organism relevant to your data
  org.Hs.eg.db,
  # The vector of gene identifiers we want to map
  keys = rownames(GSE198256_count),
  # Replace with the type of gene identifiers in your data
  keytype = "ENTREZID",
  # Replace with the type of gene identifiers you would like to map to
  column = "ENSEMBL",
  # In the case of 1:many mappings, return the
  # first one. This is default behavior!
  multiVals = "first"
)
## 'select()' returned 1:many mapping between keys and columns
# We would like a data frame we can join to the differential expression stats
gene_key_df <- data.frame(
  ensembl_id = ENSEMBL_vector,
  entrez_id = names(ENSEMBL_vector),
  stringsAsFactors = FALSE
) %>%
  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
  # from the data frame
  dplyr::filter(!is.na(ensembl_id))

Now, lets conduct ORA analysis

# Step 1: determine genes of interest.since coef=1 we are looking at the contrast Covid19AI-Healthy
diff_table <- topTable(fit3,coef=1,p.value=0.01,number=10000) 
genes_dif<- rownames(diff_table)

# Step 2: determine background.

background_set <- unique(rownames(logCPM))

# Step 3: Determine gene sets (KEGG).

msigdbr_species()
## # A tibble: 20 × 2
##    species_name                    species_common_name                          
##    <chr>                           <chr>                                        
##  1 Anolis carolinensis             Carolina anole, green anole                  
##  2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
##  3 Caenorhabditis elegans          <NA>                                         
##  4 Canis lupus familiaris          dog, dogs                                    
##  5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
##  6 Drosophila melanogaster         fruit fly                                    
##  7 Equus caballus                  domestic horse, equine, horse                
##  8 Felis catus                     cat, cats, domestic cat                      
##  9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
## 10 Homo sapiens                    human                                        
## 11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
## 12 Monodelphis domestica           gray short-tailed opossum                    
## 13 Mus musculus                    house mouse, mouse                           
## 14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
## 15 Pan troglodytes                 chimpanzee                                   
## 16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
## 17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
## 18 Schizosaccharomyces pombe 972h- <NA>                                         
## 19 Sus scrofa                      pig, pigs, swine, wild boar                  
## 20 Xenopus tropicalis              tropical clawed frog, western clawed frog
hs_msigdb_df <- msigdbr(species = "Homo sapiens")
head(hs_msigdb_df)
## # A tibble: 6 × 15
##   gs_cat gs_subcat      gs_name        gene_symbol entrez_gene ensembl_gene   
##   <chr>  <chr>          <chr>          <chr>             <int> <chr>          
## 1 C3     MIR:MIR_Legacy AAACCAC_MIR140 ABCC4             10257 ENSG00000125257
## 2 C3     MIR:MIR_Legacy AAACCAC_MIR140 ABRAXAS2          23172 ENSG00000165660
## 3 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACTN4                81 ENSG00000130402
## 4 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACTN4                81 ENSG00000282844
## 5 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACVR1                90 ENSG00000115170
## 6 C3     MIR:MIR_Legacy AAACCAC_MIR140 ADAM9              8754 ENSG00000168615
## # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## #   human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## #   gs_exact_source <chr>, gs_url <chr>, gs_description <chr>
hs_kegg_df <- hs_msigdb_df %>%
  dplyr::filter(
    gs_cat == "C2", # This is to filter only to the C2 curated gene sets
    gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
  )

# Step 4: conduct ORA.

kegg_ora_results_c1 <- enricher(
  gene = genes_dif, # A vector of your genes of interest
  pvalueCutoff = 0.1, # Can choose a FDR cutoff
  pAdjustMethod = "BH", # Method to be used for multiple testing correction
  universe = background_set, # A vector containing your background set genes
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = dplyr::select(
    hs_kegg_df,
    gs_name,
    human_entrez_gene
  )
)

# Step 5: Visualize / explore

enrich_plot_kegg_c1 <- enrichplot::dotplot(kegg_ora_results_c1)
enrich_plot_kegg_c1

upset_plot_kegg_c1 <- enrichplot::upsetplot(kegg_ora_results_c1)
upset_plot_kegg_c1

ACTIVITY 3.1: alternatives to KEGG?

KEGG database allow us to identify particular biological pathways in which our DEGs could be involved, in addition to the pathway functional annotation the other very popular database is Gene ontology (GO), So lets run ORA on GO to establish if our DEGs are associated with particular biological processes.

# Step 1:   I will keep the same as before. coef=1 we are looking at the contrast Covid19AI-Healthy
# Step 2: determine background. I will keep the same

# Step 3: Determine gene sets (GO).

msigdbr_species()
## # A tibble: 20 × 2
##    species_name                    species_common_name                          
##    <chr>                           <chr>                                        
##  1 Anolis carolinensis             Carolina anole, green anole                  
##  2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
##  3 Caenorhabditis elegans          <NA>                                         
##  4 Canis lupus familiaris          dog, dogs                                    
##  5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
##  6 Drosophila melanogaster         fruit fly                                    
##  7 Equus caballus                  domestic horse, equine, horse                
##  8 Felis catus                     cat, cats, domestic cat                      
##  9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
## 10 Homo sapiens                    human                                        
## 11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
## 12 Monodelphis domestica           gray short-tailed opossum                    
## 13 Mus musculus                    house mouse, mouse                           
## 14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
## 15 Pan troglodytes                 chimpanzee                                   
## 16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
## 17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
## 18 Schizosaccharomyces pombe 972h- <NA>                                         
## 19 Sus scrofa                      pig, pigs, swine, wild boar                  
## 20 Xenopus tropicalis              tropical clawed frog, western clawed frog
hs_msigdb_df <- msigdbr(species = "Homo sapiens")
head(hs_msigdb_df)
## # A tibble: 6 × 15
##   gs_cat gs_subcat      gs_name        gene_symbol entrez_gene ensembl_gene   
##   <chr>  <chr>          <chr>          <chr>             <int> <chr>          
## 1 C3     MIR:MIR_Legacy AAACCAC_MIR140 ABCC4             10257 ENSG00000125257
## 2 C3     MIR:MIR_Legacy AAACCAC_MIR140 ABRAXAS2          23172 ENSG00000165660
## 3 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACTN4                81 ENSG00000130402
## 4 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACTN4                81 ENSG00000282844
## 5 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACVR1                90 ENSG00000115170
## 6 C3     MIR:MIR_Legacy AAACCAC_MIR140 ADAM9              8754 ENSG00000168615
## # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## #   human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## #   gs_exact_source <chr>, gs_url <chr>, gs_description <chr>
hs_go_df <- hs_msigdb_df %>%
  dplyr::filter(
    gs_cat == "C5", # This is to filter only to the C5 ontology gene sets (according to https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) 
    gs_subcat == "GO:BP" # This is because we want all the GO biological process terms, I can modify this parameter deppending on the GO   I want to analyze, for GO molecular function use GO:MF, for GO celular component use GO:CC
  )

# Step 4: conduct ORA.

gobp_ora_results_c1 <- enricher(
  gene = genes_dif, # A vector of your genes of interest
  pvalueCutoff = 0.1, # Can choose a FDR cutoff
  pAdjustMethod = "BH", # Method to be used for multiple testing correction
  universe = background_set, # A vector containing your background set genes
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = dplyr::select(
    hs_go_df,
    gs_name,
    human_entrez_gene
  )
)

# Step 5: Visualize / explore

enrich_plot_gobp_c1 <- enrichplot::dotplot(gobp_ora_results_c1)
enrich_plot_gobp_c1

upset_plot_gobp_c1 <- enrichplot::upsetplot(gobp_ora_results_c1)
upset_plot_gobp_c1

It is important also to remember that in our DEG analysis we interrogate three different contrasts of COVID19 samples at different recovery time vs healthy samples. So we can modify our set of genes of interest (defined in step 1 in the previous chuncks) to identify overrepresented pathways/GO in the different contrasts.Previously we used coef=1, meaning we performed our analysis using the DEGs from the contrast Covid19AI-Healthy. Now, lets change coef=2 to perform the ORA using DEGs from the contrast Covid193Mo-Healthy.

# Step 1: determine genes of interest.coef=2 we are looking at the contrast Covid193Mo-Healthy
diff_table2 <- topTable(fit3,coef=2,p.value=0.01,number=10000) 
genes_dif2<- rownames(diff_table2)

#step 2 as before
#Step 3 as before 

# Step 4: conduct ORA KEGG.

kegg_ora_results_c2 <- enricher(
  gene = genes_dif2, # A vector of your genes of interest
  pvalueCutoff = 0.1, # Can choose a FDR cutoff
  pAdjustMethod = "BH", # Method to be used for multiple testing correction
  universe = background_set, # A vector containing your background set genes
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = dplyr::select(
    hs_kegg_df,
    gs_name,
    human_entrez_gene
  )
)

# Step 4.1: conduct ORA GO:BP.

gobp_ora_results_c2 <- enricher(
  gene = genes_dif2, # A vector of your genes of interest
  pvalueCutoff = 0.1, # Can choose a FDR cutoff
  pAdjustMethod = "BH", # Method to be used for multiple testing correction
  universe = background_set, # A vector containing your background set genes
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = dplyr::select(
    hs_go_df,
    gs_name,
    human_entrez_gene
  )
)
# Step 5: Visualize / explore

enrich_plot_kegg_c2 <- enrichplot::dotplot(kegg_ora_results_c2)
enrich_plot_kegg_c2

#enrich_plot_gobp_c2 <- enrichplot::dotplot(gobp_ora_results_c2)
#enrich_plot_gobp_c2

#This analysis did not return any significant GO:BP therefore the plot will not work 
#Quitting from lines 350-396 [ORA Covid193Mo-Healthy] (GSE198256_Analysis_Class2.Rmd)
                                                                                                                           
#Error in `ans[ypos] <- rep(yes, length.out = len)[ypos]`:
#! replacement has length zero

We got only one enrich pathway in KEGG and none GO:BP significantly enriched in this contrasts, this could be due to our small input set of genes. let’s see what happen if we try with all degs (p_value <0.05) not only those with p_value <0.01

diff_table2 <- topTable(fit3,coef=2,p.value=0.05,number=10000) 
genes_dif2<- rownames(diff_table2)
# Step 4: conduct ORA KEGG.

kegg_ora_results_c2 <- enricher(
  gene = genes_dif2, # A vector of your genes of interest
  pvalueCutoff = 0.1, # Can choose a FDR cutoff
  pAdjustMethod = "BH", # Method to be used for multiple testing correction
  universe = background_set, # A vector containing your background set genes
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = dplyr::select(
    hs_kegg_df,
    gs_name,
    human_entrez_gene
  )
)

# Step 4.1: conduct ORA GO:BP.

gobp_ora_results_c2 <- enricher(
  gene = genes_dif2, # A vector of your genes of interest
  pvalueCutoff = 0.1, # Can choose a FDR cutoff
  pAdjustMethod = "BH", # Method to be used for multiple testing correction
  universe = background_set, # A vector containing your background set genes
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = dplyr::select(
    hs_go_df,
    gs_name,
    human_entrez_gene
  )
)
# Step 5: Visualize / explore

enrich_plot_kegg_c2 <- enrichplot::dotplot(kegg_ora_results_c2)
enrich_plot_kegg_c2

enrich_plot_gobp_c2 <- enrichplot::dotplot(gobp_ora_results_c2)
enrich_plot_gobp_c2

We increased the number of DEGs overrepresented in the KEGG pathway but our p.adj was reduced, and we got some GO:BP enriched in our set but still with a not super significant p_value. In this case,when a ORA is not sufficient to revel biological insights of our data, a good alternative would be to move on the GSEA analysis since is a method that does not depend on the DEGs but in the totality of expressed genes.

Lets conduct GSEA

set.seed(101)

# Step 1: determine genes of interest (all genes expressed)
diff_table_all <- topTable(fit3,coef=1,p.value=1,number=nrow(logCPM)) 

# Step 2: determine background. We don't need a backgroup for GSEA 

# Step 3: Determine gene sets.

#msigdbr_species()
hs_msigdb_df <- msigdbr(species = "Homo sapiens")
#head(hs_msigdb_df, n=3)

hs_kegg_df <- hs_msigdb_df %>%
  dplyr::filter(
    gs_cat == "C2", # This is to filter only to the C2 curated gene sets
    gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
  )

# Step 4: conduct GSEA
# ordering by paramenter B (we can order by logFC, p-value or any metric we like)
list_ordered <- diff_table_all[,"B"]
names(list_ordered) <- rownames(diff_table_all)
  
  
gsea_results_kegg <- GSEA(
  geneList = list_ordered, # Ordered ranked gene list
  minGSSize = 25, # Minimum gene set size
  maxGSSize = 500, # Maximum gene set set
  pvalueCutoff = 0.05, # p-value cutoff
  eps = 0, # Boundary for calculating the p value
  seed = TRUE, # Set seed to make results reproducible
  pAdjustMethod = "BH", # Benjamini-Hochberg correction
  TERM2GENE = dplyr::select(
    hs_kegg_df,
    gs_name,
    human_entrez_gene
  )
)
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.01% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : There were 1 pathways for which P-values were not calculated
## properly due to unbalanced (positive and negative) gene-level statistic values.
## For such pathways pval, padj, NES, log2err are set to NA. You can try to
## increase the value of the argument nPermSimple (for example set it nPermSimple
## = 10000)
## leading edge analysis...
## done...
# Step 5: Visualize / explore
#check results
#head(gsea_results_kegg@result)

#previous visualization is not optimal so lets transform it in a data frame for better visualization 
gsea_result_df <- data.frame(gsea_results_kegg@result)
#our top up results are:
gsea_result_df %>%
  # This returns the 3 rows with the largest NES values
  dplyr::slice_max(NES, n = 3)
##                                                                            ID
## KEGG_ALZHEIMERS_DISEASE                               KEGG_ALZHEIMERS_DISEASE
## KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY
## KEGG_LEISHMANIA_INFECTION                           KEGG_LEISHMANIA_INFECTION
##                                                                   Description
## KEGG_ALZHEIMERS_DISEASE                               KEGG_ALZHEIMERS_DISEASE
## KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY
## KEGG_LEISHMANIA_INFECTION                           KEGG_LEISHMANIA_INFECTION
##                                        setSize enrichmentScore       NES
## KEGG_ALZHEIMERS_DISEASE                    139      -0.2168379 -1.442677
## KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY      91      -0.2536014 -1.569666
## KEGG_LEISHMANIA_INFECTION                   63      -0.2735039 -1.572445
##                                             pvalue   p.adjust     qvalue rank
## KEGG_ALZHEIMERS_DISEASE                0.018511178 0.04059469 0.01948545 5666
## KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY 0.008769623 0.02192406 0.01052355 3791
## KEGG_LEISHMANIA_INFECTION              0.014215722 0.03173152 0.01523113 4382
##                                                          leading_edge
## KEGG_ALZHEIMERS_DISEASE                tags=43%, list=35%, signal=29%
## KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY tags=41%, list=23%, signal=31%
## KEGG_LEISHMANIA_INFECTION              tags=38%, list=27%, signal=28%
##                                                                                                                                                                                                                                                                                                                                  core_enrichment
## KEGG_ALZHEIMERS_DISEASE                7385/1347/5664/23621/1329/102/4512/572/4694/4698/23385/513/1537/7388/6622/7124/1349/4714/4697/4700/11261/4722/27089/4716/55851/317/842/9451/51107/489/4708/5594/4723/1327/5533/4707/5331/517/5595/5330/4724/7386/4695/54205/10975/4731/5530/6389/4725/22926/2932/6868/9167/841/1350/514/351/5663/7381/805
## KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY                                                                                                                 7124/4893/6885/11261/4793/3586/8517/23533/5605/5296/3845/3937/2885/387/5294/5594/5533/5291/5595/6654/4690/1326/7409/1739/5788/5530/5970/3725/3551/2932/5777/5295/5170/10451/998/5293/5894
## KEGG_LEISHMANIA_INFECTION                                                                                                                                                                                              4689/10454/7124/6885/7040/4793/3586/2002/3688/6772/5594/3654/5595/7189/5579/4688/3716/5970/2214/3725/5777/23118/3717/1378
# then, lets plot those top up results of our interest

most_positive_nes_plot <- enrichplot::gseaplot(
  gsea_results_kegg,
  geneSetID = "KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY",
  title = "KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY",
  color.line = "#0d76ff"
)

most_positive_nes_plot

#our top down results are:
gsea_result_df %>%
  # Return the 3 rows with the smallest (most negative) NES values
  dplyr::slice_min(NES, n = 3)
##                                                                                ID
## KEGG_SPLICEOSOME                                                 KEGG_SPLICEOSOME
## KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY
## KEGG_INSULIN_SIGNALING_PATHWAY                     KEGG_INSULIN_SIGNALING_PATHWAY
##                                                                       Description
## KEGG_SPLICEOSOME                                                 KEGG_SPLICEOSOME
## KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY
## KEGG_INSULIN_SIGNALING_PATHWAY                     KEGG_INSULIN_SIGNALING_PATHWAY
##                                          setSize enrichmentScore       NES
## KEGG_SPLICEOSOME                             126      -0.4447394 -2.921911
## KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY      54      -0.4463884 -2.472979
## KEGG_INSULIN_SIGNALING_PATHWAY               111      -0.3653569 -2.353010
##                                                pvalue     p.adjust       qvalue
## KEGG_SPLICEOSOME                         3.867506e-15 4.834383e-13 2.320504e-13
## KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY 4.441754e-07 1.110438e-05 5.330104e-06
## KEGG_INSULIN_SIGNALING_PATHWAY           5.831834e-08 2.429931e-06 1.166367e-06
##                                          rank                   leading_edge
## KEGG_SPLICEOSOME                         5285 tags=70%, list=32%, signal=48%
## KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY 4748 tags=57%, list=29%, signal=41%
## KEGG_INSULIN_SIGNALING_PATHWAY           3730 tags=48%, list=23%, signal=37%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    core_enrichment
## KEGG_SPLICEOSOME                         10286/6625/5093/22916/84991/3304/84321/8559/9410/51691/51503/9939/10569/23451/11157/10189/220988/6432/51690/3303/10285/10465/3192/10713/4116/58517/8683/10992/10946/3310/10772/8896/9416/56259/6631/9092/10450/10262/144983/84844/6635/22985/25804/6427/26121/9343/9128/6628/10523/56949/57461/988/8449/151903/27258/9775/6634/6637/11325/6633/22827/83443/11338/1659/9785/4809/29896/22938/55696/84950/9716/55119/1665/3178/6429/23450/153527/6428/55660/25949/23350/9879/57187/3312/4670/6434/3190/3183
## KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY                                                                                                                                                                                                                                                                                                                                           6352/4210/10454/7184/7124/6885/4793/8517/3320/2920/3606/10392/22900/5594/55914/5595/4671/7189/10910/329/331/5970/257397/64127/3551/841/23118/9051/7128/3326/834
## KEGG_INSULIN_SIGNALING_PATHWAY                                                                                                                                                                                                                                       4893/1978/5257/10603/5573/5565/5567/3643/5584/23533/1398/5605/5296/2002/3845/5106/5834/6720/5571/6198/7248/2885/5294/5594/57521/51763/6199/122809/5291/5595/5499/9021/7249/2475/6654/5256/6009/5566/369/3551/5500/2932/23265/9470/8660/5295/5170/2872/1399/5293/5894/5836/805
# then, lets plot those top negative results of our interest
most_negative_nes_plot <- enrichplot::gseaplot(
  gsea_results_kegg,
  geneSetID = "KEGG_SPLICEOSOME",
  title = "KEGG_SPLICEOSOME",
  color.line = "#0d76ff"
)

most_negative_nes_plot

ACTIVITY 3.2: alternatives to KEGG?

As in the ORA analysis. Lets perform GSEA on GO biological processes. I decide to use it since is one of the most extended and traditional GO enrichment reported in literature. Depending on the biological questions, sometimes, the authors also report GO cellular component, and GO molecular function, but for a global insight the GO biological process in the one most commonly reported in manuscripts. This is the main reason I am limiting my analysis to this ontology. However, we can not discard interesting things in the others.

# Step 1: determine genes of interest (all genes expressed). same as previously

# Step 2: determine background. We don't need a backgroup for GSEA 

# Step 3: Determine gene sets.

hs_msigdb_df <- msigdbr(species = "Homo sapiens")

hs_go_df <- hs_msigdb_df %>%
  dplyr::filter(
    gs_cat == "C5", # This is to filter only to the C5 ontology gene sets (according to https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) 
    gs_subcat == "GO:BP" # This is because we want all the GO biological process terms, I can modify this parameter deppending on the GO   I want to analyze, for GO molecular function use GO:MF, for GO celular component use GO:CC
  )

# Step 4: conduct GSEA
# ordering by paramenter B (we can order by logFC, p-value or any metric we like)
list_ordered <- diff_table_all[,"B"]
names(list_ordered) <- rownames(diff_table_all)
  
  
gsea_results_gobp <- GSEA(
  geneList = list_ordered, # Ordered ranked gene list
  minGSSize = 25, # Minimum gene set size
  maxGSSize = 500, # Maximum gene set set
  pvalueCutoff = 0.05, # p-value cutoff
  eps = 0, # Boundary for calculating the p value
  seed = TRUE, # Set seed to make results reproducible
  pAdjustMethod = "BH", # Benjamini-Hochberg correction
  TERM2GENE = dplyr::select(
    hs_go_df,
    gs_name,
    human_entrez_gene
  )
)
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.01% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : There were 4 pathways for which P-values were not calculated
## properly due to unbalanced (positive and negative) gene-level statistic values.
## For such pathways pval, padj, NES, log2err are set to NA. You can try to
## increase the value of the argument nPermSimple (for example set it nPermSimple
## = 10000)
## leading edge analysis...
## done...
# Step 5: Visualize / explore
#check results
#head(gsea_results_gobp@result)

#previous visualization is not optimal so lets transform it in a data frame for better visualization 
gsea_result_gobp_df <- data.frame(gsea_results_gobp@result)
#our top up results are:
gsea_result_gobp_df %>%
  # This returns the 3 rows with the largest NES values
  dplyr::slice_max(NES, n = 3)
##                                                                                                          ID
## GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION
## GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION                       GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION
## GOBP_ADAPTIVE_IMMUNE_RESPONSE                                                 GOBP_ADAPTIVE_IMMUNE_RESPONSE
##                                                                                                 Description
## GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION
## GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION                       GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION
## GOBP_ADAPTIVE_IMMUNE_RESPONSE                                                 GOBP_ADAPTIVE_IMMUNE_RESPONSE
##                                                       setSize enrichmentScore
## GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION      32       0.2500021
## GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION                485      -0.1751175
## GOBP_ADAPTIVE_IMMUNE_RESPONSE                             350      -0.1824545
##                                                             NES      pvalue
## GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION  1.784572 0.011797080
## GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION            -1.354272 0.006878453
## GOBP_ADAPTIVE_IMMUNE_RESPONSE                         -1.370761 0.012330278
##                                                         p.adjust     qvalue
## GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION 0.04077941 0.02457175
## GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION            0.02660760 0.01603248
## GOBP_ADAPTIVE_IMMUNE_RESPONSE                         0.04196772 0.02528776
##                                                       rank
## GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION 6491
## GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION            6509
## GOBP_ADAPTIVE_IMMUNE_RESPONSE                         3596
##                                                                         leading_edge
## GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION tags=75%, list=40%, signal=45%
## GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION            tags=47%, list=40%, signal=29%
## GOBP_ADAPTIVE_IMMUNE_RESPONSE                         tags=29%, list=22%, signal=23%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     core_enrichment
## GOBP_DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  8356/8358/8352/8355/8368/8357/8354/8366/121504/8362/8363/8359/8353/8208/8370/8968/8360/554313/8364/8350/8520/8361/10036/8294
## GOBP_MICROTUBULE_CYTOSKELETON_ORGANIZATION            79000/152206/440145/51199/54617/54801/79819/5310/7277/9928/23354/159686/5879/4744/6674/2316/59341/10297/653125/6314/79864/5590/729440/79925/8674/55857/653073/54970/11127/374407/55755/115106/1453/100653515/56907/54908/123811/54919/387885/8086/55812/84071/653075/23172/11200/9493/29098/3996/9055/2801/23332/23639/8243/7517/3304/163786/10013/11337/81930/117178/10615/54461/4134/10039/8312/403/158135/10432/221421/26160/3985/653464/4204/84790/26140/79649/57787/55559/114791/636/4957/10634/1874/64793/221150/6622/23093/84942/26112/55125/79643/114327/6188/672/79441/84223/55201/2011/27436/6904/348654/63979/347240/50810/157922/11261/8874/55172/9820/89796/84142/29899/57704/3303/10464/8556/140609/152185/6760/5300/3192/51143/7430/8882/675/23353/1069/285331/7175/256364/116840/8481/546/9371/3688/9857/7461/9419/23032/10376/387/11186/583/1605/4682/56890/1104/1263/5901/9851/8409/54462/85440/29082/23116/23113/5119/10636/5311/3619/57701/4659/6566/338657/26973/22995/9702/57545/10540/8480/25897/9738/4292/26586/51115/57606/8841/26005/3796/1639/9181/28964/10726/10982/5108/25/79648/10048/23291/1783/4627/1739/1267/23299/51652/85459/80254/9793/10426/2932/5518/10015/7320/1107/10460/3837/3064/7415/6683/57551/26065/4140/25777/25978/9126/9475/7756/324/203068/22919/6093/9748/6249/128866/7514/9611/10735/55704
## GOBP_ADAPTIVE_IMMUNE_RESPONSE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      6885/814/8741/11118/64421/7037/2015/22976/6778/7040/84766/1514/201294/3586/975/149041/6891/6494/10859/79915/93978/57823/25865/200558/3606/5817/11221/91543/2067/84695/604/51752/3136/10459/384/4277/7248/3554/11119/64167/1075/5294/64092/57192/338339/3659/9655/7189/4292/89857/2475/5579/1512/163486/3665/64332/64218/84868/3566/567/5585/3071/7468/5788/202309/3140/1604/10384/2214/64127/440275/11025/23108/54537/695/3106/8832/7454/6868/6556/55183/51111/5777/7030/81622/3107/5027/1445/54542/4683/3570/7128/3146/3717/5293/10673/4179/64005/113/1378/3105
# then, lets plot those top up results of our interest

most_positive_nes_plot_bp <- enrichplot::gseaplot(
  gsea_results_gobp,
  geneSetID = "GOBP_ADAPTIVE_IMMUNE_RESPONSE",
  title = "GOBP_ADAPTIVE_IMMUNE_RESPONSE",
  color.line = "#0d76ff"
)
most_positive_nes_plot_bp

#our top down results are:
gsea_result_gobp_df %>%
  # Return the 3 rows with the smallest (most negative) NES values
  dplyr::slice_min(NES, n = 3)
##                                                ID              Description
## GOBP_SELECTIVE_AUTOPHAGY GOBP_SELECTIVE_AUTOPHAGY GOBP_SELECTIVE_AUTOPHAGY
## GOBP_MACROAUTOPHAGY           GOBP_MACROAUTOPHAGY      GOBP_MACROAUTOPHAGY
## GOBP_MRNA_PROCESSING         GOBP_MRNA_PROCESSING     GOBP_MRNA_PROCESSING
##                          setSize enrichmentScore       NES       pvalue
## GOBP_SELECTIVE_AUTOPHAGY      72      -0.4662090 -2.812911 9.125004e-10
## GOBP_MACROAUTOPHAGY          283      -0.3722641 -2.740254 6.383003e-20
## GOBP_MRNA_PROCESSING         447      -0.3563106 -2.733401 2.467874e-27
##                              p.adjust       qvalue rank
## GOBP_SELECTIVE_AUTOPHAGY 5.559408e-08 3.349837e-08 4427
## GOBP_MACROAUTOPHAGY      4.175136e-17 2.515740e-17 4334
## GOBP_MRNA_PROCESSING     6.014209e-24 3.623878e-24 4951
##                                            leading_edge
## GOBP_SELECTIVE_AUTOPHAGY tags=62%, list=27%, signal=46%
## GOBP_MACROAUTOPHAGY      tags=51%, list=26%, signal=38%
## GOBP_MRNA_PROCESSING     tags=56%, list=30%, signal=40%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        core_enrichment
## GOBP_SELECTIVE_AUTOPHAGY                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              65992/84971/23130/9821/55187/29979/79876/65018/10075/10241/23376/83667/7416/23647/29110/54543/84749/8408/7157/91452/90678/51368/10392/23192/162427/55626/57674/3964/55072/55054/5595/7249/550/9927/124997/8678/64127/9776/51569/11331/55102/10193/3064/23001/292
## GOBP_MACROAUTOPHAGY                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              8396/84971/23130/9821/90410/25989/55062/55187/60412/27131/79643/6048/29978/29979/79876/6885/1201/65018/10075/5868/192111/10241/23241/5861/11267/23376/8517/83667/83734/4943/60673/26276/7416/643246/23647/29110/54543/8027/22878/84749/10919/8408/535/9550/8897/9342/4734/83933/11345/64112/7157/55763/91452/30849/23673/90678/7248/51368/10392/55014/5305/165324/23192/27183/29082/5119/6829/162427/57533/25851/55626/57674/3964/128077/4534/11152/57192/55072/23367/55054/5595/9559/143187/200576/79065/7249/51160/4077/2475/550/9927/140775/124997/23545/9711/9373/9146/55432/79443/112936/51652/4864/51322/8678/6009/57724/55207/64127/8673/27072/5566/5289/23049/23274/9776/51028/9637/9842/51569/23265/22930/64771/11331/55102/10193/3064/94241/7415/10490/64419/25978/5768/23001/2773/23256/292/5663/533/64422/5049/120892/55737/112574/128866
## GOBP_MRNA_PROCESSING     55131/84321/51163/8559/9410/50628/9904/51691/10657/58509/101954264/9667/199746/57721/708/10432/51503/550643/58490/27257/8732/54960/9967/10283/9939/79706/1207/22984/285672/55544/25862/7737/10569/10898/10978/26835/11319/23451/151987/11157/10284/23028/8241/55702/140890/79760/10189/54496/220988/23210/10659/9238/23064/6432/26528/51690/5511/51341/23398/11193/9588/22889/83759/4928/10285/11218/5433/23435/101954267/57466/10465/104/23543/51634/54623/3192/58506/8882/5935/79171/10713/4116/58517/55094/8683/10992/84081/5411/90826/8220/9513/103/29894/10946/56254/10772/8896/5930/6294/51574/55599/29890/9416/9541/8189/3297/6631/60625/84872/79005/101954277/55225/123169/8731/201626/9092/54439/10450/57661/55051/27238/10212/3191/57805/10262/144983/84844/6635/8563/22985/8161/25804/79753/6607/6829/6427/8458/167153/26121/9733/2332/11129/9343/23070/9128/8106/6628/6827/51428/56949/57461/125950/494115/23076/51593/55015/60493/9406/988/6606/101954276/11168/8449/84967/29101/27258/7936/55285/9775/83989/55796/6634/57703/9295/10921/11171/8621/51755/94104/6637/55596/3189/6633/87178/79882/55149/22827/55197/83443/10179/55234/11338/11051/1659/7072/9785/4809/4904/8087/8899/8570/55280/64062/23248/29896/22938/27336/54890/10236/55696/84950/5566/9716/55119/7884/1665/3550/10147/9169/1660/22794/23517/51747/3178/6429/23450/153527/170506/55421/25962/6428/2926/79811/55660/23091/5978/25949/351/10768/22913/79577/22803/9879/57187/3312/9589/4670/6434/3190/3183/9991
# then, lets plot those top negative results of our interest
most_negative_nes_plot_bp <- enrichplot::gseaplot(
  gsea_results_gobp,
  geneSetID = "GOBP_SELECTIVE_AUTOPHAGY",
  title = "GOBP_SELECTIVE_AUTOPHAGYE",
  color.line = "#0d76ff"
)
most_negative_nes_plot_bp

Interpretation The previous analysis provided the enriched KEGG patwhays and GO:BP for our ranked genes, The normalized enrichment score (NES) is the primary statistic for examining gene set enrichment results. By normalizing the enrichment score, GSEA accounts for differences in gene set size and in correlations between gene sets and the expression dataset; therefore, the normalized enrichment scores (NES) can be used to compare analysis results across gene sets(GSEA manual). Based on the previuos definition,let’s start by examining first, the results from KEGG pathways. The NES tell us how over or down represent is a gene set respect to the ranked list. In our case, we look into the top 3 smallest NES and the top bigest NES. In this way our most enriched pathways signatures were: KEGG_ALZHEIMERS_DISEASE, KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY and KEGG_LEISHMANIA_INFECTION while the less enriched signatures are: KEGG_SPLICEOSOME, KEGG_INSULIN_SIGNALING_PATHWAY and KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY. But in any case all our NES are negative, meaning that the KEGG pathways will be mostly at the bottom of our ranker list, since we ranked our list by the B statistic (log odds) and considering that in our results from limma-voom the adjusted p values ranks the genes in the same order as B we can assume that in our list the genes are ranked from the most to the least significant.So the KEGG pathways genes are mostly represented in our less significant genes (in other words less DE) but for example KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY is represented in genes that are more significant (higher B) and KEGG_SPLICEOSOME in genes less significant (lower B than for KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY) but in any case all of them are at the bottom. So I would consider this pathways are not specially representative of the most contrasting genes (lets say DEGs) but of other genes showing slighly changes in expression across the conditions.This could be a reflection of subtle changes in the expression of many genes that in the end could represent an impact at a biological level. If we move now to the GO:BP we have a positive NES for the DNA_REPLICATION_DEPENDENT_CHROMATIN_ORGANIZATION indicating that genes in this set will be mostly represented at the top of our ranked list (higher B score, most significant genes). In contrast, all the other NES are negative meaning that all the other enriched pathways are mostly represented at the bottom of the ranked list (low B score, less significant genes). To me, this result could make sense since we are analysing samples derived from patients that have been exposed to a COVID19 infection, so the virus can induce a response in genes related to DNA replication and chromosomal stability ( most significant) in comparison to non infected controls since the cells of of this patients are dealing with the RNA replication of the virus that can disturb the natural processes of replication, cell repairing and cell division check points.

ACTIVITY 3.3: compare GSEA vs ORA?

ORA methods differ from GSEA because they only consider the query gene set of interest and need a strict cutoff to classify genes, also the results can be strongly affected by the statistical test and the visualization techniques used (Chicco & Agapito,2022). Despite these differences in our data set we can observed some similar terms/pathways enriched using both approaches as is the case of the terms/pathways related to spliceosome and immune response. So to me, this tools could be used as complementary methods, for instance in a ORA analysis since the focus is on DEGs the processes identified with the negative NES of the GSEA most probably will not appear since are very small and statistical not-significant changes in gene expression, however, the accumulation of minor signal could be masking complex biological processes, worth of attention. In addition, the use of this tools also depends a lot in the biological question that the researcher want to adress and in how much knows about the phenomenon of interest. For a exploratory process I would go with a GSEA analysis, but for more well-structured hypothesis probably and ORA could reveal the relevant information, of course if the set of genes to analyze was careful and appropriately selected and is a reflection of the biological process in study.

ACTIVITY 3.4: running GeneSetCluster

GeneSetCluster is a novel approach which allows clustering of identified gene-sets, from one or multiple experiments and/or tools, based on shared genes. GeneSetCluster calculates a distance score based on overlapping gene content, which is then used to cluster them together and as a result, GeneSetCluster identifies groups of gene-sets with similar gene-set definitions (i.e. gene content). These groups of gene-sets can aid the researcher to focus on such groups for biological interpretations(Ewing et al, 2020)

library(GeneSetCluster)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Warning: replacing previous import 'AnnotationDbi::select' by
## 'clusterProfiler::select' when loading 'GeneSetCluster'
## Warning: replacing previous import 'ComplexHeatmap::%v%' by 'network::%v%' when
## loading 'GeneSetCluster'
## 
## Warning: replacing previous import 'cowplot::align_plots' by
## 'patchwork::align_plots' when loading 'GeneSetCluster'
## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading
## 'GeneSetCluster'
## Warning: replacing previous import 'dplyr::filter' by 'stats::filter' when
## loading 'GeneSetCluster'
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when
## loading 'GeneSetCluster'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when
## loading 'GeneSetCluster'
path = "/Users/cardonky/Library/CloudStorage/GoogleDrive-kelly.cardonalondono@kaust.edu.sa/My Drive/My_Macbook/PhD-Classes/UnderstandingBioinfoPipelines/Class1"

GSEA.files <- paste0(path, "/", list.files(path, pattern = ".csv"))

# Load the data and create Pathway object
# Automatically for GSEA, GREAT or IPA
GSEA.Object1 <- LoadGeneSets(file_location = GSEA.files, 
                              groupnames= c("GSEA_Hvs6Mo", "GSEA_HvsAI"), # names of the groups
                              P.cutoff = 0.05, # cut off the p.adjust
                              Mol.cutoff = 15, # minimum number of genes per pathway
                              Source = "GSEA", # the analysis (GSEA, GREAT or IPA)
                              structure = "ENTREZID", # Gene type (SYMBOL, ENTREZID, ENSEMBLID)
                              Organism = "org.Hs.eg.db", # database: Homo Sapiens or Mus musculus
                              seperator = "/") # the separator used for listing genes
## [=========================================================]
## [<<<<            LoadGeneSets START                  >>>>>]
## -----------------------------------------------------------
## Loading data from /Users/cardonky/Library/CloudStorage/GoogleDrive-kelly.cardonalondono@kaust.edu.sa/My Drive/My_Macbook/PhD-Classes/UnderstandingBioinfoPipelines/Class1/GSEA_Hvs6Mo.csv
## Loading data from /Users/cardonky/Library/CloudStorage/GoogleDrive-kelly.cardonalondono@kaust.edu.sa/My Drive/My_Macbook/PhD-Classes/UnderstandingBioinfoPipelines/Class1/GSEA_HvsAI.csv
## [=========================================================]
## [<<<<            ObjectCreator START                 >>>>>]
## -----------------------------------------------------------
## -----------------------------------------------------------
## [<<<<<               ObjectCreator END              >>>>>>]
## [=========================================================]
## [You may want to process CombineGeneSets next.            ]
## [or merge objects using MergeObjects                     ]
## [or select certain types from objects using ManageGeneSets]
## -----------------------------------------------------------
## [<<<<<               LoadGeneSets END               >>>>>>]
## [=========================================================]
## [You may want to process CombineGeneSets next.            ]
## [or merge objects using MergeObjects                     ]
## [or select certain types from objects using ManageGeneSets]
GSEA.Object2 <- CombineGeneSets(Object = GSEA.Object1,
                                combineMethod = "Standard", threads = 8)
## [=========================================================]
## [<<<<            CombineGeneSets START               >>>>>]
## -----------------------------------------------------------
## Combining experiments
## preparing condensed display
## calulating RR
## Parallelizing processes, 8 cores have been selected.
## -----------------------------------------------------------
## [<<<<<             CombineGeneSets END              >>>>>>]
## [=========================================================]
## [You may want to process ClusterGeneSets next.            ]
OptimalGeneSets(Object = GSEA.Object2, 
                uniquePathway = FALSE, # consider all the pathways (also repeated) or the unique pathways
                method = "silhouette", max_cluster= 24, cluster_method = "kmeans", main= "Kmeans for 24 clusters")
## [=========================================================]
## [<<<<            OptimalGeneSets START               >>>>>]
## -----------------------------------------------------------
## Finding optimal number of clusters
## Clustering method =  kmeans
## Optimizing method =  silhouette
## -----------------------------------------------------------
## [<<<<<             OptimalGeneSets END              >>>>>>]
## [=========================================================]
## [You may want to process ClusterGeneSets next.            ]

OptimalGeneSets(Object = GSEA.Object2, 
                uniquePathway = TRUE, # consider all the pathways (also repeated) or the unique pathways
                method = "silhouette", max_cluster= 24, cluster_method = "kmeans", main= "Kmeans for 24 clusters")
## [=========================================================]
## [<<<<            OptimalGeneSets START               >>>>>]
## -----------------------------------------------------------
## Finding optimal number of clusters
## Clustering method =  kmeans
## Optimizing method =  silhouette
## -----------------------------------------------------------
## [<<<<<             OptimalGeneSets END              >>>>>>]
## [=========================================================]
## [You may want to process ClusterGeneSets next.            ]

# in both cases the optimal cluster is 2

GSEA.Object3 <- ClusterGeneSets(Object = GSEA.Object2, 
                                clusters = 2, # consider all the pathways (also repeated) or the unique pathways
                                method = "Hierarchical", # Hierarchical clustering or kmeans
                                order = "cluster",
                                molecular.signature = "All")
## [=========================================================]
## [<<<<            ClusterGeneSets START               >>>>>]
## -----------------------------------------------------------
## Using all Gene sets
## Running Hierarchical
## Ordering pathway clusters
## -----------------------------------------------------------
## [<<<<<             ClusterGeneSets END              >>>>>>]
## [=========================================================]
## [You may want to process HighlightGeneSets next.            ]
## [You may want to plot the results using PlotGeneSets next. ]
# plot results for both all pathways and unique pathways
plotnounique <- PlotGeneSets(GSEA.Object3, 
                             uniquePathways = FALSE, 
                             wordcloud = FALSE, # wordcloud only supported for GO terms, semantic enrichment
                             doORA = T) # do ora per cluster
## Performing ORA for 2 clusters and choosing the most overrepresented pathway...
## No legend element is put in the last 6 rows under `nrow = 8`, maybe you
## should set `by_row = FALSE`? Reset `nrow` to 2.

plotunique <- PlotGeneSets(GSEA.Object3, 
                           uniquePathways = TRUE, 
                           wordcloud = FALSE, # wordcloud only supported for GO terms
                           doORA = T) # do ora per cluster
## Performing ORA for 2 clusters and choosing the most overrepresented pathway...
## No legend element is put in the last 6 rows under `nrow = 8`, maybe you
## should set `by_row = FALSE`? Reset `nrow` to 2.

# let's say we are interested in exploring cluster 2 in plotunique. Lets break up this cluste for further analysis 

plotoptimalcluster2 <- OptimalGeneSets(Object = GSEA.Object3, 
                uniquePathway = TRUE, # consider all the pathways (also repeated) or the unique pathways
                cluster = 2, # which cluster
                method = "silhouette", max_cluster= 24, cluster_method = "kmeans", main= "Kmeans for 24 clusters in cluster 1")
## [=========================================================]
## [<<<<            OptimalGeneSets START               >>>>>]
## -----------------------------------------------------------
## Finding optimal number of clusters
## Clustering method =  kmeans
## Optimizing method =  silhouette
## -----------------------------------------------------------
## [<<<<<             OptimalGeneSets END              >>>>>>]
## [=========================================================]
## [You may want to process ClusterGeneSets next.            ]

plotoptimalcluster2 # optimal 2 break up cluster 2 in 2 clusters

GSEA.Object3breakup <- BreakUpCluster(GSEA.Object3, 
                                      breakup.cluster = 2, # which cluster
                                      sub.cluster = 2, # in how many cluster split up
                                      uniquePathways = TRUE) # conside unique pathways
## [=========================================================]
## [<<<<            BreakUpCluster START                >>>>>]
## -----------------------------------------------------------
## Loading required package: RColorBrewer
## Warning in BreakUpCluster(GSEA.Object3, breakup.cluster = 2, sub.cluster = 2, :
## Currently only working for Kmeans and Hierarchical
## Ordering pathway clusters
## -----------------------------------------------------------
## [<<<<<             BreakUpCluster END               >>>>>>]
## [=========================================================]
## [You may want to process HighlightGeneSets next.           ]
## [You may want to plot the results using PlotGeneSets next. ]
plotuniquebreakup <- PlotGeneSets(GSEA.Object3breakup, 
                                  uniquePathways = TRUE, 
                                  wordcloud = FALSE, # wordcloud only supported for GO terms
                                  doORA = T) # do ora per cluster
## Performing ORA for 3 clusters and choosing the most overrepresented pathway...
## No legend element is put in the last 5 rows under `nrow = 8`, maybe you
## should set `by_row = FALSE`? Reset `nrow` to 3.

plotuniquebreakup

# Now break up the cluster 1 
plotoptimalcluster1 <- OptimalGeneSets(Object = GSEA.Object3, 
                uniquePathway = TRUE, # consider all the pathways (also repeated) or the unique pathways
                cluster = 1, # which cluster
                method = "silhouette", max_cluster= 24, cluster_method = "kmeans", main= "Kmeans for 24 clusters in cluster 1")
## [=========================================================]
## [<<<<            OptimalGeneSets START               >>>>>]
## -----------------------------------------------------------
## Finding optimal number of clusters
## Clustering method =  kmeans
## Optimizing method =  silhouette
## -----------------------------------------------------------
## [<<<<<             OptimalGeneSets END              >>>>>>]
## [=========================================================]
## [You may want to process ClusterGeneSets next.            ]

plotoptimalcluster1 # optimal 1 break up cluster 1 in 9 clusters

GSEA.Object3breakup2 <- BreakUpCluster(GSEA.Object3breakup, 
                                      breakup.cluster = 1, # which cluster
                                      sub.cluster = 9, # in how many cluster split up
                                      uniquePathways = TRUE) # conside unique pathways
## [=========================================================]
## [<<<<            BreakUpCluster START                >>>>>]
## -----------------------------------------------------------
## Warning in BreakUpCluster(GSEA.Object3breakup, breakup.cluster = 1, sub.cluster
## = 9, : Currently only working for Kmeans and Hierarchical
## Ordering pathway clusters
## -----------------------------------------------------------
## [<<<<<             BreakUpCluster END               >>>>>>]
## [=========================================================]
## [You may want to process HighlightGeneSets next.           ]
## [You may want to plot the results using PlotGeneSets next. ]
plotuniquebreakup2 <- PlotGeneSets(GSEA.Object3breakup2, 
                                   uniquePathways = TRUE, 
                                   wordcloud = FALSE, # wordcloud only supported for GO terms
                                   doORA = T) # do ora per cluster
## Performing ORA for 11 clusters and choosing the most overrepresented pathway...

plotuniquebreakup2

This tool provides too many visualization and information, so let’s explore the results.

# plot results for both all pathways and unique pathways
plotnounique <- PlotGeneSets(GSEA.Object3, 
                             uniquePathways = FALSE, 
                             wordcloud = FALSE, # wordcloud only supported for GO terms
                             doORA = T) # do ora per cluster
## Performing ORA for 2 clusters and choosing the most overrepresented pathway...
## No legend element is put in the last 6 rows under `nrow = 8`, maybe you
## should set `by_row = FALSE`? Reset `nrow` to 2.

plotunique <- PlotGeneSets(GSEA.Object3, 
                           uniquePathways = TRUE, 
                           wordcloud = FALSE, # wordcloud only supported for GO terms
                           doORA = T) # do ora per cluster
## Performing ORA for 2 clusters and choosing the most overrepresented pathway...
## No legend element is put in the last 6 rows under `nrow = 8`, maybe you
## should set `by_row = FALSE`? Reset `nrow` to 2.

let’s say we are interested in exploring cluster 2 in plotunique. Lets break up this cluster for further analysis

plotoptimalcluster2 <- OptimalGeneSets(Object = GSEA.Object3, 
                uniquePathway = TRUE, # consider all the pathways (also repeated) or the unique pathways
                cluster = 2, # which cluster
                method = "silhouette", max_cluster= 24, cluster_method = "kmeans", main= "Kmeans for 24 clusters in cluster 1")
## [=========================================================]
## [<<<<            OptimalGeneSets START               >>>>>]
## -----------------------------------------------------------
## Finding optimal number of clusters
## Clustering method =  kmeans
## Optimizing method =  silhouette
## -----------------------------------------------------------
## [<<<<<             OptimalGeneSets END              >>>>>>]
## [=========================================================]
## [You may want to process ClusterGeneSets next.            ]

plotoptimalcluster2 # optimal 2 break up cluster 2 in 2 clusters

GSEA.Object3breakup <- BreakUpCluster(GSEA.Object3, 
                                      breakup.cluster = 2, # which cluster
                                      sub.cluster = 2, # in how many cluster split up
                                      uniquePathways = TRUE) # conside unique pathways
## [=========================================================]
## [<<<<            BreakUpCluster START                >>>>>]
## -----------------------------------------------------------
## Warning in BreakUpCluster(GSEA.Object3, breakup.cluster = 2, sub.cluster = 2, :
## Currently only working for Kmeans and Hierarchical
## Ordering pathway clusters
## -----------------------------------------------------------
## [<<<<<             BreakUpCluster END               >>>>>>]
## [=========================================================]
## [You may want to process HighlightGeneSets next.           ]
## [You may want to plot the results using PlotGeneSets next. ]
plotuniquebreakup <- PlotGeneSets(GSEA.Object3breakup, 
                                  uniquePathways = TRUE, 
                                  wordcloud = FALSE, # wordcloud only supported for GO terms
                                  doORA = T) # do ora per cluster
## Performing ORA for 3 clusters and choosing the most overrepresented pathway...
## No legend element is put in the last 5 rows under `nrow = 8`, maybe you
## should set `by_row = FALSE`? Reset `nrow` to 3.

plotuniquebreakup

Now break up the cluster 1

plotoptimalcluster1 <- OptimalGeneSets(Object = GSEA.Object3, 
                uniquePathway = TRUE, # consider all the pathways (also repeated) or the unique pathways
                cluster = 1, # which cluster
                method = "silhouette", max_cluster= 24, cluster_method = "kmeans", main= "Kmeans for 24 clusters in cluster 1")
## [=========================================================]
## [<<<<            OptimalGeneSets START               >>>>>]
## -----------------------------------------------------------
## Finding optimal number of clusters
## Clustering method =  kmeans
## Optimizing method =  silhouette
## -----------------------------------------------------------
## [<<<<<             OptimalGeneSets END              >>>>>>]
## [=========================================================]
## [You may want to process ClusterGeneSets next.            ]

plotoptimalcluster1 # optimal 1 break up cluster 1 in 9 clusters

GSEA.Object3breakup2 <- BreakUpCluster(GSEA.Object3breakup, 
                                      breakup.cluster = 1, # which cluster
                                      sub.cluster = 9, # in how many cluster split up
                                      uniquePathways = TRUE) # conside unique pathways
## [=========================================================]
## [<<<<            BreakUpCluster START                >>>>>]
## -----------------------------------------------------------
## Warning in BreakUpCluster(GSEA.Object3breakup, breakup.cluster = 1, sub.cluster
## = 9, : Currently only working for Kmeans and Hierarchical
## Ordering pathway clusters
## -----------------------------------------------------------
## [<<<<<             BreakUpCluster END               >>>>>>]
## [=========================================================]
## [You may want to process HighlightGeneSets next.           ]
## [You may want to plot the results using PlotGeneSets next. ]
plotuniquebreakup2 <- PlotGeneSets(GSEA.Object3breakup2, 
                                   uniquePathways = TRUE, 
                                   wordcloud = FALSE, # wordcloud only supported for GO terms
                                   doORA = T) # do ora per cluster
## Performing ORA for 11 clusters and choosing the most overrepresented pathway...

plotuniquebreakup2

GeneSetCluster address the challenges of interpreting large gene-set analysis outcomes, or when integration of data from different results is combined is a promising tool but futher refinement of the vignette on the interpretation of the results would be very helpful for the user since the package offers many visualization options and is hard to understand how, why and when use each one.

REFERENCES

  1. Law, C.W., Chen, Y., Shi, W. et al. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15, R29 (2014). https://doi.org/10.1186/gb-2014-15-2-r29

2.. Law CW, Zeglinski K, Dong X, Alhamdoosh M, Smyth GK, Ritchie ME. A guide to creating design matrices for gene expression experiments. F1000Res. 2020 Dec 10;9:1444. doi: 10.12688/f1000research.27893.1. PMID: 33604029; PMCID: PMC7873980.

  1. Tong, Y. (2021). The comparison of limma and DESeq2 in gene analysis. In E3S Web of Conferences (Vol. 271, p. 03058). EDP Sciences.

  2. https://hbctraining.github.io/Training-modules/planning_successful_rnaseq/slides/functional_analysis_mp.pdf

  3. Chicco, D., & Agapito, G. (2022). Nine quick tips for pathway enrichment analysis. PLoS computational biology, 18(8), e1010348.

  4. Ewing, E., Planell-Picola, N., Jagodic, M., & Gomez-Cabrero, D. (2020). GeneSetCluster: a tool for summarizing and integrating gene-set analysis results. Bmc Bioinformatics, 21(1), 1-7.